home *** CD-ROM | disk | FTP | other *** search
- From the README:
-
- Rel is a program that determines the relevance of text documents to a
- set of keywords expressed in boolean infix notation. The list of file
- names that are relevant are printed to the standard output, in order
- of relevance.
-
- For example, the command:
-
- rel "(directory & listing)" /usr/share/man/cat1
-
- (ie., find the relevance of all files that contain both of the words
- "directory" and "listing" in the catman directory) will list 21 files,
- out of the 782 catman files, (totaling 6.8 MB,) of which "ls.1" is the
- fifth most relevant-meaning that to find the command that lists
- directories in a Unix system, the "literature search" was cut, on
- average, from 359 to 5 files, or a reduction of approximately 98%. The
- command took 55 seconds to execute on a on a System V, rel. 4.2
- machine, (20Mhz 386 with an 18ms. ESDI drive,) which is a considerable
- expediency in relation to browsing through the files in the directory
- since ls.1 is the 359'th file in the directory. Although this example
- is remedial, a similar expediency can be demonstrated in searching for
- documents in email repositories and text archives.
-
- Additional applications include information robots, (ie., "mailbots,"
- or "infobots,") where the disposition (ie., delivery, filing, or
- viewing,) of text documents can be determined dynamically, based on
- the relevance of the document to a set of criteria, framed in boolean
- infix notation. Or, in other words, the program can be used to order,
- or rank, text documents based on a "context," specified in a general
- mathematical language, similar to that used in calculators.
-
- General description of the program:
-
- This program is an experiment to evaluate using infix boolean
- operations as a heuristic to determine the relevance of text files in
- electronic literature searches. The operators supported are, "&" for
- logical "and," "|" for logical "or," and "!" for logical "not."
- Parenthesis are used as grouping operators, and "partial key" searches
- are fully supported, (meaning that the words can be abbreviated.) For
- example, the command:
-
- rel "(((these & those) | (them & us)) ! we)" file1 file2 ...
-
- would print a list of filenames that contain either the words "these"
- and "those", or "them" and "us", but doesn't contain the word "we"
- from the list of filenames, file1, file2, ... The order of the printed
- file names is in order of relevance, where relevance is determined by
- the number of incidences of the words "these", "those", "them", and
- "us", in each file. The general concept is to "narrow down" the number
- of files to be browsed when doing electronic literature searches for
- specific words and phrases in a group of files using a command similar
- to:
-
- more `rel "(((these & those) | (them & us)) ! we)" file1 file2`
-
- Applicability:
-
- Applicability of rel varies on complexity of search, size of database,
- speed of host environment, etc., however, as some general guidelines:
-
- 1) For text files with a total size of less than 5 MB, rel, and
- standard egrep(1) queries of the text files will probably prove
- adequate.
-
- 2) For text files with a total size of 5 MB to 50 MB, qt seems
- adequate for most queries. The significant issue is that, although
- the retrieval execution times are probably adequate with qt, the
- database write times are not impressive. Qt is listed in "Related
- information retrieval software:," below.
-
- 3) For text files with a total size that is larger than 50 MB, or
- where concurrency is an issue, it would be appropriate to consider
- one of the other alternatives listed in "Related information
- retrieval software:," below.
-
- Extensibility:
-
- The source was written with extensibility as an issue. To alter
- character transliterations, see uppercase.c for details. For
- enhancements to phrase searching and hyphenation suggestions, see
- translit.c.
-
- It is possible to "weight" the relevance determination of
- documents that are composed in one of the standardized general
- markup languages, like TeX/LaTeX, or SGML. The "weight" of the
- relevance of search matches depends on where the words are found
- in the structure of the document, for example, if the search was
- for "numerical" and "methods," \chapter{Numerical Methods} would
- be weighted "stronger" than if the words were found in
- \section{Numerical Methods}, which in turn would be weighted
- "stronger" than if the words were found in a paragraph. This would
- permit relevance of a document to be determined by how author
- structured the document. See eval.c for suggestions.
-
- The list of identifiers in the search argument can be printed to
- stdio, possibly preceeded by a '+' character and separated by '|'
- characters to make an egrep(1) compatible search argument, which
- could, conceivably, be used as the search argument in a browser so
- that something like:
-
- "browse `rel arg directory'"
-
- would automatically search the directory for arg, load the files
- into the browser, and skip to the first instance of an identifier,
- with one button scanning to the next instance, and so on. See
- postfix.c for suggestions.
-
- The source architecture is very modularized to facilitate adapting
- the program to different environments and applications, for
- example, a "mailbot" can be constructed by eliminating
- searchpath.c, and constructing a list of postfix stacks, with
- perhaps an email address element added to each postfix stack, in
- such a manner that the program could be used to scan incoming
- mail, and if the mail was relevant to any postfix criteria, it
- would be forwarded to the recipient.
-
- The program is capable of running as a wide area, distributed,
- full text information retrieval system. A possible scenario would
- be to distribute a large database in many systems that are
- internetworked together, presumably via the Unix inet facility,
- with each system running a copy of the program. Queries would be
- submitted to the systems, and the systems would return individual
- records containing the count of matches to the query, and the file
- name containing the matches, perhaps with the machine name, in
- such a manner that the records could be sorted on the "count
- field," and a network wide "browser" could be used to view the
- documents, or a script could be made to use the "r suite" to
- transfer the documents into the local machine. Obviously, the
- queries would be run in parallel on the machines in the
- network-concurrency would not be an issue. See the function,
- main(), below, for suggestions.
-
- References:
-
- 1) "Information Retrieval, Data Structures & Algorithms," William
- B. Frakes, Ricardo Baeza-Yates, Editors, Prentice Hall, Englewood
- Cliffs, New Jersey 07632, 1992, ISBN 0-13-463837-9.
-
- The sources for the many of the algorithms presented in 1) are
- available by ftp, ftp.vt.edu:/pub/reuse/ircode.tar.Z
-
- 2) "Text Information Retrieval Systems," Charles T. Meadow,
- Academic Press, Inc, San Diego, 1992, ISBN 0-12-487410-X.
-
- 3) "Full Text Databases," Carol Tenopir, Jung Soon Ro, Greenwood
- Press, New York, 1990, ISBN 0-313-26303-5.
-
- 4) "Text and Context, Document Processing and Storage," Susan
- Jones, Springer-Verlag, New York, 1991, ISBN 0-387-19604-8.
-
- 5) ftp think.com:/wais/wais-corporate-paper.text
-
- 6) ftp cs.toronto.edu:/pub/lq-text.README.1.10
-
- Related information retrieval software:
-
- 1) Wais, available by ftp, think.com:/wais/wais-8-b5.1.tar.Z.
-
- 2) Lq-text, available by ftp,
- cs.toronto.edu:/pub/lq-text1.10.tar.Z.
-
- 3) Qt, available by ftp,
- ftp.uu.net:/usenet/comp.sources/unix/volume27.
-
- john@johncon.com (John Conover)
- Campbell, California, USA
- September, 1995
-